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We consider the problem of estimating the predictive density of future observations from a 
non-parametric regression model. The density estimators are evaluated under Kullback-Leibler 
divergence and our focus is on establishing the exact asymptotics of minimax risk in the case of 
Gaussian errors. We derive the convergence rate and constant for minimax risk among Bayesian 
predictive densities under Gaussian priors and we show that this minimax risk is asymptotically 
equivalent to that among all density estimators. 
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1. Introduction 

Consider the canonical non-parametric regression setup 

Y(t i ) = f(t i )+ae i , i = l,...,n, (1.1) 

where / is an unknown function in £ 2 [0, 1], U =-ijn and the e^s are i.i.d. standard 
Gaussian random variables. We assume the noise level a is known and, without loss of 
generality, set a = 1 throughout. 

Based on observing Y = (Y(ti), . . . , Y(t n )), estimating / or various functionals of / 
has been the central problem in non-parametric function estimation. The asymptotic op- 
timality of estimators is usually associated with the optimal rate of convergence in terms 
of minimax risk. A huge body of literature has been devoted to the evaluation of minimax 
risks under £ 2 loss over certain function spaces; see, for example, Pinsker [21], Ibragimov 
and Has'minskii [16], Golubev and Nussbaum [14], Efroimovich [8], Belitser and Levit 
[3, 4] and Goldenshluger and Tsybakov [13]. An excellent survey of the literature in this 
area can be found in Efromovich [9]. 



This is an electronic reprint of the original article published by the ISI/BS in Bernoulli, 
2010, Vol. 16, No. 2, 543-560. This reprint differs from the original in pagination and 
typographic detail. 



1350-7265 © 2010 ISI/BS 



544 



X. Xu and F. Liang 



Sometimes, instead of estimating / itself, one is interested in making statistical infer- 
ence about future observations from the same process that generated Y(t). A predictive 
distribution function assigns probabilities to all possible outcomes of a random variable. 
It thus provides a complete description of the uncertainty associated with a prediction. 
The minimaxity of predictive density estimators has been studied for finite-dimensional 
parametric models; see, for example, Liang and Barron [18], George, Liang and Xu [11], 
Asian [2] and George and Xu [12]. However, so far, few results have been obtained on 
predictive density estimation for non-parametric models. The major thrust of this pa- 
per is to establish the asymptotic minimax risk for predictive density estimation under 
Kullback-Leibler loss in the context of non-parametric regression. Our result closely par- 
allels the well-known work by Pinsker [21] for non-parametric function estimation under 
L 2 loss and provides a benchmark for studying the optimality of density estimates for 
non-parametric regression. 

Let Y = (Y(ui), . . . , Y(u m )Y denote a vector of future observations from model (1.1) 
at locations {w,}™ y- To evaluate the performance of density prediction across the whole 
curve, we assume that the itj's are equally spaced dense (that is, m>n) grids in [0, 1]. 
Given /, the conditional density p{y\f) is a product of N(yj; f(uj)), where N(-;fi) denotes 
a univariate Gaussian density function with mean \i and unit variance. Based on observing 
Y = y, we estimate p(y\f) by a predictive density p(y\y), a non-negative function of y 
that integrates to 1 with respect to y. 

Common approaches to constructing p(y\y) includes the "plug-in" rule that simply 
substitutes an estimate / for / in p(jj\f), 

n 

p(y\f) = l[N(y 3 ;f(u J )), (1.2) 

j'=i 

and the Bayes rule that integrates / with respect to a prior n to obtain 

P(y\fHf\vW= JpiylfMf)df • (i-3) 

We measure the discrepancy between p(y\f) and p(y\y) by the average Kullback-Leibler 
(KL) divergence 



m Y > Y U p(Y\Y) 



R(f,p) = -E Y ^ f log^J^. (1.4) 



Assuming that / belongs to a function space J 7 , such as a Sobolev space, we are interested 
in the minimax risk 

R(J-) = min max i? (/,p). (1.5) 

It is worth observing that in this framework, the densities of future observations 
(Yi, . . . , Y m ) are estimated simultaneously by p(y\y). An alternative approach is to esti- 
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mate the densities individually by {p(yj\y)}^Li with risk 

When the Uj 's are equally spaced and m goes to infinity, the risk above converges to 

which can be interpreted as the integrated KL risk of prediction at a random location u 
in [0, 1]. This individual prediction problem can be studied in our simultaneous prediction 
framework with p{y\y) restricted to a product form, that is, p(y\y) = njli-Pfe'ly)- For 
example, the plug-in estimator (1.2) has such a product form and it is easy to check 
that its individual estimation risk (1.6) is the same as its simultaneous estimation risk 
(1.4). In general, simultaneous prediction considers a broader class of p than the one 
considered by individual prediction. Therefore, simultaneous prediction is more efficient 
since the corresponding minimax risk (1.5) is less than or equal to the one with individual 
prediction. This is distinct from estimating / itself under C 2 loss where, due to the 
additivity of C 2 loss, simultaneous estimation and individual estimation are equivalent. 

This paper is organized as follows. In Section 2, we show that the problem of predictive 
density estimation for a non-parametric regression model can be converted to the one 
for a Gaussian sequence model with a constrained parameter space. Direct evaluation of 
the minimax risk is difficult because of the constraint on the parameter space. Therefore, 
in Section 3, we first derive the minimax risk over a special class of p that consists of 
predictive densities under Gaussian priors on the unconstrained parameter space K™. 
Then, in Section 4, we show that this minimax risk is asymptotically equivalent to the 
overall minimax risk. Finally, in Section 5, we provide two explicit examples of minimax 
risks over C 2 balls and Sobolev spaces. 



2. Connection to Gaussian sequence models 

Let {4>i}iZi be the orthonormal trigonometric basis of C 2 [0, 1], that is, 

Mt) = 1, { 02fe - 1 = ^(27^), fc=lj2 

I 4>2k = V2cos(2nkx), 

Then, / = Y^iLi 0i<pi, where 0i = f(t)4>i(t) dt is the coefficient with respect to the ith 
basis element (pi . A function space T corresponds to a constraint on the parameter space 
of 0. In this paper, we consider function spaces whose parameter spaces have ellipsoid 
constraints, that is, 

e(C) = \e:jra 2 8 2 <c\, (2.1) 
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where ai < a% < • • • and a„ — > oo. 

We approximate / by a finite summation /„ — ^2™—i&i<f>i- The bias incurred by esti- 
mating p{y\f n ) instead of p(y\f) can be expressed as 

Bias(/J^ = l^ |/ log^^- = -^f][/K)-/nK)] 2 = ^ L E 

This bias is often negligible compared to the prediction risk (1.4); for example, it is of 
order 0(n~ 2a ) for Sobolov ellipsoids 0(C, a), as defined in (5.3). Therefore, from now 
on, we set / = /„. 

Let 8 = (0i, 6*2, ■ • ■ , n Y, §a be a n x n matrix whose (i,j)th entry equals 4>j(ti) and 
$b be a m x n matrix whose (z,j)th entry equals (j)j(ui). Then, Y\8 and Y\6 are two 
independent Gaussian vectors with Y\9 ~ N(^a9, I n ) and y|# ~ N(<&b9, hn), where I n 
denotes the n x n identity matrix. Note that since the ti 's and Uj 's are equally spaced, 
we have = nl n and $^$3 = ml n . Defining 

X = ->S> A Y and X = — ¥ R Y , (2.2) 
n m 

it is then easy to check that X and X are independent and that 

X\6~N(0,v n I n ) and X|0 ~ JV(0, Um I„), ( 2 .3) 

where v n = l/n and u m = 1/m. We refer to the model above as a Gaussian sequence 
model since its number of parameters is increasing at the same rate as the number of 
data points. 

Consider the problem of predictive density estimation for the Gaussian sequence model 
(2.3). Let p(x\x) denote a predictive density function of i given X = x. The incurred KL 
risk is defined to be 

\ >yj m x,x\e & p( X \X) 
and the corresponding minimax risk is given by 

i?(9)=inf sup R(6,p). (2.4) 
p eee(C) 

The following theorem states that the two minimax risks, the one associated with (Y, Y) 
from a non-parametric regression model and the one associated with (X, X) from a 
normal sequence model, are equivalent. 

Theorem 2.1. R(T) = R(e), where R(T) is defined in (1.5) and R(Q) in (2.4). 
Proof. See the Appendix. □ 
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Remark. The idea of reducing a non-parametric regression model to a Gaussian se- 
quence model via an orthonormal function basis has been widely used for non-parametric 
function estimation. Early references include Ibraginov and Has'minskii [15], Efromovich 
and Pinskcr [10] and references therein. For recent developments, sec Brown and Low [6], 
Nussbaum [19, 20] and Johnstone [17]. Our proof of Theorem 2.1, given in the Appendix, 
implies that simultaneous estimation of predictive densities in these two models are 
equivalent. However, this equivalence does not hold for the individual estimation ap- 
proach described in Section 1 because the product form of the density estimators, that 
is, p(y\y) = Yij PiVjlv) t * s n °t retained under the transformation. 

3. Linear minimax risk 

Direct evaluation of the minimax risk (2.4) is difficult because the parameter space 0(C) 
is constrained. In this section, we first consider a subclass of density estimators that have 
simple forms and investigate the minimax risk over this subclass. In next section, we then 
show that the minimax risk over this subclass is asymptotically equivalent to the overall 
minimax risk R. Such an approach was first used in Pinsker [21] to establish a minimax 
risk bound for the function estimation problem. It inspired a series of developments, 
including Bclitser and Levit [3, 4], Tsybakov [22] and Goldenshlugcr and Tsybakov [13]. 

Recall that in the problem of estimating the mean of a Gaussian sequence model under 
L 2 loss, diagonal linear estimators of the form 8i = CiXi play an important role. Indeed, 
Pinsker [21] showed that when the parameter space (2.1) is an ellipsoid, the minimax 
risk among diagonal linear estimators is asymptotically minimax among all estimators. 
Moreover, the results in Diaconis and Ylvisaker [7] imply that if such a diagonal linear 
estimator is Bayes, then the prior ir must be a Gaussian prior with a diagonal covariancc 
matrix. Similarly, in investigating the minimax risk of predictive density estimation, we 
first restrict our attention to a special class of p that are Bayes rules under Gaussian 
priors over the unconstrained parameter space l n . Due to the above connection, we call 
these predictive densities linear predictive densities and call the minimax risk over this 
class the linear minimax risk, even though 'linear' does not have any literal meaning in 
our setting. 

Under a Gaussian prior tts{0) = N(0,S), where S — diag(si, . . . , s n ) and Si > for 
i = 1, ... ,n, the linear predictive density ps is given by 



Note that ps is not a Bayes estimator for the problem described in Section 2 because 
the prior distribution N(0, S) is supported on IR" instead of on the ellipsoidal space O. 
Nonetheless, p$ is a valid predictive density function. 

The following lemma provides an explicit form of the average KL risk of ps- 




(3.1) 
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Lemma 3.1. The average Kullback-Leibler risk (1-4) of ps is given by 



R(6,p s ) = -^iog^ + J-£ 

2to ti„4. m 2m f— ' 



2to w„ +m 
where v n+rn = l/(n + to) . 



lof 



V n + Si 



(3.2) 



Proof. Let pu denote the posterior predictive density under the uniform prior ttjj = 1, 
namely, 



PU{X\X) 



1 



2™. 



n+m 



i/2 



exp 



2v 



Then, by [11], Lemma 2, the average KL risk of ps is given by 

R{8,p s ) = R(e 7 pu) - -Elogm s (W;v n+m ) + -Elogm s (X;v n ), (3.3) 

TO TO 

where 

W = N(6,v n+m I) 

^n+m 

and TOs(:r; <t 2 ) denotes the marginal distribution of X\0 ~ N n (9,a 2 1) under the normal 
prior 7Ts. It is easy to check that 



R(6,pu) = -Slog 1 



— log V 



to pu(x\x) 2m v 



n+m 



and 



n _^ n 

£ , logm 5 (W / ;u„ +m ) = -— log[27t(w„ +m + s<)] - — ^ 



n+m 



2TO W„+,„ + Si 



E\ogm s (X]v n ) 



i=i 



2m -f-f v n + Si 



The lemma then follows immediately by combining equations (3.3)-(3.6). 
We denote the linear minimax risk over all ps by -Rl(0), that is, 

R L (Q) = M sup R{8,p s ). 

s eee(C) 



(3.4) 

(3.5) 

(3.6) 

□ 

(3.7) 



This linear minimax risk is not directly tractable because the inside maximization is over 
a constrained space 0(C). In the following theorem, we first show that we can switch 
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the order of inf and sup in equation (3.7) and then evaluate Rl using the Lagrange 
multiplier method. 

The following notation will be useful throughout. Let A(C, v„, v n+m ) denote a solution 
of the equation 



i=l 



{v„ - V n+m )\ 1 + 



4A/a? 



V n +r 



(v n + v n+m ) 



2C, 



where [x} + = sup(x,0), and let 9f be 
pro 1 



v / 4A/a? 
(v n - v n+m )\ 1 H : («„ + v n+m ) 



Vn V n +m 



-I - 



for i = 1,2, . . . ,n. 



(3.8) 



(3.9) 



Theorem 3.2. Suppose that the parameter space 6(C) is an ellipsoid, as defined in 
(2.1). The linear minimax risk is then given by 



i£i(6) = inf sup R(6,ps)= sup inf i?(#,ps) 



s eee(C) 



eee(c) 



— b • — | 1 y^io 

2m w n+m 2m ^—f 

2—1 



vn + ef 



(3.10) 
(3.11) 



where B\ is defined as in (3.9). The linear minimax estimator p v is the Bayes predictive 
density under a Gaussian prior 



namely, 



with 



% V {6)=N{Q,V), where V = dmg{ei,ei,...,e 2 n ) 



p v (x\x) = N(9 v ,Zy), 



(3.12) 



-x\,. 



+ V n ' On + V n 



diag 



'1 t- v n 



KVn 



^m-, • • • i ~ 



n + l 'n 



Proof. We first prove equality (3.11). It is easy to check that for any fixed 8, R(9,ps) 
achieves its minimum at S = dia,g(9l,...,6l), and 



m£R(0,p s ) = ^log^ 
S 2m v n 



1 n 
2m ^ 6 



Vn+m + 0. 

vn + ef 
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To calculate the maximum of the above quantity over 9 £ 6(C), one needs to solve 

{n Q 2 n ^ 

With the Lagrangian 

simple calculation reveals that the maximum is attained at 9i given by (3.9). 

Next, we prove equality (3.10), that is, that the order of inf and sup can be exchanged. 
Note that for any diagonal matrix S, we have 

sup i?(p;j,6>) > inf sup R(ps,0)> sup inf R(ps, 9). (3.13) 
eee(C) 5 eee(C) eee(c) s 

Therefore, if there exists an S such that 

sup R{pg,9)— sup inf R(ps,9) < 0, 
6>ee(C) eee(c) s 

then all of the inequalities in (3.13) become equalities. 
If we let S = diag(0? , . ..,§1), then 

R(Ps,6)- sup MR(ps, 9) = ±Y {Vn - V -: +mm - fp 

2m A 

where the second equality holds because Yl7=i a i®i = ^ anc ^ ^1 ^ s a solution to 

2 



£>£ _ Vn ~ Vn+m O. 

39} ~ (v n +9f)(v n+m + 9l) A 
Since 9 € 9(C) implies that Yh=i a i@i ^ C, we have 

1 C — C 

sup R{pg,9)- sup inf R{p Sl 9) < = — = 0, 

8e&(C) eee(C) 6 ZTO A 

which completes the proof. □ 
Remark. Note that a\ < a 2 < ■ ■ ■, so we have 9f = for i > N, where 

N = svL-p\i:a 2 l < x( — -)=toa1. (3.14) 

L \V m +n V n J J 



Asymptotic minimax density prediction 



551 



This implies that the prior distribution corresponding to the linear minimax estimator, 
that is, TTy (9) = n" = i -^(0, Of), puts a point mass at zero for 9i for all i> N. 

4. Asymptotic minimax risk 

In this section, we turn to establishing the asymptotic behavior of the minimax risk 
i?(9) over all predictive density estimators. By definition, i?(9) < Rl(@). We extend 
the approach in [3] to show that the difference between i?(0) and i?i(0) vanishes as 
the number of observations n goes to infinity. Therefore, the overall minimax risk is 
asymptotically equivalent to the linear minimax risk. This also implies that the Gaussian 
prior TTy defined in (3.12) is asymptotically least favorable. 

The following lemma provides a lower bound for the overall minimax risk i?(0) under 
some conditions. 

Lemma 4.1. Let {s?}™ =1 be a sequence such that for some a > 0, 

n 
i=l 

Then, as n — > oo, the minimax risk R(Q) has the following lower bound: 

Proof. See the Appendix. □ 

Note that, as shown in the proof, for a posterior density with a Gaussian prior 7rg = 
N(0,S), where S = diag(si, . . . , s n ), condition (4.1) guarantees its to have most of its 
mass inside 6, in the sense that 7rs(6 c ) < vf" for some a > 0. 

With the lower bound in the above lemma, we are ready to prove the main result in 
this paper, which shows that the overall minimax risk R(Q) is asymptotically equivalent 
to the linear minimax risk Rl(@). 

Theorem 4.2. Suppose that is the ellipsoid defined in (2.1 ) and 9 2 is defined in (3.9). 
If m = 0(n) and 

n 

log(lK)^a^ 4 =o(l), asv n ^0, (4.2) 

i=l 

then 



4 si 



log V r , 



I 



<C. 



(4.1) 



g(g) 

L oi?i(e) ' 



(4.3) 
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Proof. By definition, R(Q) < Ri(Q). So, to prove this theorem, it suffices to show that 
as v n — > 0, 

R(Q)>R L (&)(l-o(l)). 

For a fixed constant a > 1, let 7 = i[8alog(l/v„) £" =1 a *^] 1/2 and let b i = #f O 1 + 
7) _1 for i = 1,. . .,n. It is easy to check that the sequence {6;}" =1 satisfies the condition 
(4.1). Therefore, by Theorem 4.1, 

1 71 L r 2 

n//~>\ ^ n 1 W " ^ V^i v n+m + Oj , Q \ 

i?(0) - 2 ™ log ^ + 2^^ 1 ° g ^T6T + ° K } 

= -Rl O - — 2^1og^ ^— ^ + 0«) asv n ->0. 

Next, we will derive the convergence rate of Rl(<9) and show that the other terms are 
of smaller order. 

Using the fact that Q\ = for i > N (see (3.14)), we can rewrite Rl(&) as 

n L (e) = — i g Vn + _L i g ^ n + m + of + J_ i g Vn + m 

(v n - v n+m )9f 



When m = O(n), we have v n — v n + m = 0(v n ) and w„ + w„+ m = 0(v n ). Therefore, by 
means of a Taylor expansion, 



Similarly, since 6^ = 9f = for i > TV, the second term in (4.4) can be written as 

1_ ^ K+^)K+m + el) = J_ y> , K + 6?)(«n+m + G»f) 



2m 



K+m + &f )K + Of) 2m j~{ (V n+m + b'f)(v n + Of) 



For every 1 < i < N, we have 

bg K+b?)(^n+m+g?) = bg ( 



(v n+m + &?)(«„ + Of) V [(1 + 7X+m + »i](«n + ^1) 
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= lDgfl+ 7 K~yn +m )g? _ 

V («„ + f ) K+m + f f ) + 7«ri K+m + ff? ) 

<l 0g fl + 7 ^- V "+^ ? 



(«n + d 2 )v n+m , 

Again using a Taylor expansion, as well as the condition that 7 = o(l), we obtain 

j-V logfl+ 7 K "!" +m)g n =of K-^)^ ^ =o(JiL) . (4.6) 

Finally, since m = 0(n), by choosing a > 1, the last term in (4.4) satisfies 

< = o(l). (4.7) 
Combining (4.4)-(4.7), the theorem then follows. □ 

5. Examples 

In this section, we apply Theorems 3.2 and 4.2 to establish asymptotic behaviors of 
minimax risks over some constrained parameter spaces. In particular, we consider the 
asymptotics over C? balls and Sobolev ellipsoids. 

Exam/pie 1 . Suppose that m = n and 9 is restricted in an C 2 ball, 

e(C) = |e:^ t 2 <c|- (5-1) 

The C 2 ball can be considered as a variant of the ellipsoid (2.1) with ax = ai = ■ ■ • = a n = 1 
and a n +i = a n+ 2 = ■ ■ ■ = 00. Although the values of the a^s here depend on n, the proofs 
of the above theorems are still valid. It is easy to see that N defined in (3.14) is equal to 
n and that 0? = 0§ = • ■ ■ = 6* = C/n. Therefore, 

™ pi 
(logn)J>i# = (logn)- — =o(l). 
* — ' n 

i—l 

By Theorem 4.2, the minimax risk among all predictive density estimators is asymptot- 
ically equivalent to the minimax risk among linear density estimators. Furthermore, by 
Theorem 3.2, 

lim R(B(C))= hm R L (Q(C)) = \ log2 + \ log ^^ffi = \ log l±3g. 

n->oo n-yoo 2 2 1/71 + C/n I 1+C 
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Note that this minimax risk is strictly smaller than the minimax risk over the class of 
plug-in estimators since, for any plug-in density p{x\6), 



n p{x\9) n 



2/n 



= l -E\\9-6\\ 2 (5.2) 



and by Pinsker's theorem, the minimax risk of estimating 9 under squared error loss is 
C/(l + C), which is larger than log ^pj » by the fact that x > log(l + x) for any x > 0. 

Example 2. Suppose that m = n and 9 is restricted in a Sobolev ellipsoid 



0(C,a) = \ 6 :^a?0?<C 



(5.3) 



where a^i = an-\ = (2i) a (a > 0) for i = 1,2, Then, by (3.14), we have a 2 N /\n> 

N 2a /An— s- 1 as oo. Substituting this relation into equation (3.8) yields 



_3_ 

2n 



Using the Taylor expression 



i=l V 
i w , 

= — V i 2a ( y/l + 8N 2a i- 2a - 3)(1 + o(l)). 

711 ^— — ' 



^ (i 



2^(-l) fc (2A;)! / i \ ( 2k ~ 1)a 



k=0 



(1 -2fc)fc! 2 32 fe V^V 



and the asymptotic relation 



N r+1 



we obtain 



where 



£V = — —(l + o(l)) asiV->oo,r>-l, 



N = Mn 1 ^ 2a+ V (l + o(l)) and A = Mn~ 2a ^ 2a+1 ^ (1 + o(l)), 



M 



4C 7 £tt 



272(-l) fc (2fc)! 1 3_ 

(l-2fc)fc! 2 32 fc ' (2ft + l)a + l ~ 2a + 1 



l/(2a+l) 
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Therefore, 



1 



2n 



N 



— 4/1 + 8 - - — 



2a 



2?) 



(1+0(1)). 



-I + 



" / pjia+1 \ 

(logn)J240i = O (logn) • — — = O((logn) • n^/^+D) = (i). 

i=l \ n / 

By Theorem 4.2, the minimax risk among all predictive density estimators is asymptot- 
ically equivalent to the minimax risk among the linear density estimators. Furthermore, 
by Theorem 3.2, 



i? L (6(C,a)) = ilo g 2 + 



2n^ 8 1 



1/(2") 



/n + # 



?? n - N , 1 
— H log - 

2n & 2 



2n^ 1 



l/n + 20f 



i=l 
N 



= J_yi og f 1 + _l 
2 "tt V l/n + 



It is difficult to calculate an explicit form of the optimal constant for the minimax risk 
due to the log function, but we can get an accurate bound for it. By Taylor expansion, 
there exists x* £ (0, = 1, 2, . . . , N, such that 



R L (e(C,a)) = 



/n + i 



Moreover, 



Y OJ = f- 1/(471)71 + 8^/^-3/(471) 
^l/n + 0? ^ l/n+ 1/(471)1/1 + 8(7V/i) 2Q - 3/(4n) 



where 



/v = 1 



1 1 2V2(-l) fc (2fc)! 1 

2(2a + 1) 2^(1- 2k)k\ 2 32 k ' (2k + l)a + 1 ' 



Therefore, 

lim n 2a ^ 2a+ VR(e(C,a))= Urn n 2a/(2a+1) i? L (e(C, a)) e (|#M, i#M), (5.4) 
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1.0 
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1.5 
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Figure 1. Convergence constants of the overall minimax risk (the lower red line) and the 
minimax risk over the class of plug-in estimators (the upper red line). Here, the sample size 
n= 10000000 and C= 1. 



that is, the convergence rate is n 2q /( 2q + 1 ) anc [ the convergence constant is between 
\KM and \KM. 

As in Example 1, we compare the asymptotics of this minimax risk with the one over 
the class of plug-in estimators, where the latter can be easily computed by (5.2) and the 
results in [21]. Direct comparison reveals that the convergence rates of both minimax 
risks are n 2a /( 2a + 1 ) an d the convergence constants can both be written in the form 
C 1 / {2a+1 ^f{a), where f(a) is a function depending only on a. Although it is hard to 
obtain an explicit representation for the convergence constant for the overall minimax 
risk, our simulation result in Figure 1 shows that it is strictly smaller than that over the 
class of plug-in estimators. 



Appendix: Proofs 



In this appendix, we provide the proofs of Theorem 2.1 and Lemma 4.1. 



Asymptotic minimax density prediction 



557 



Proof of Theorem 2.1. Let ^> be an m x m matrix whose (i, j)th entry equals <fij(ui). 
Since the cf>j's form an orthogonal basis for £ 2 and the Ui's are equally spaced, we have 
^r*^r = I m , Consider the transformation Since the first n columns of \I/ are $s, 

the first n elements of the transformed vector are just X, defined in (2.2), and we denote 
the remaining (m — n) elements by Z. It is easy to check that X\9 ~ N n (9, ^I n ) and 
Z ~ N m -„ (0, -5-I TO _„) are independent multivariate Gaussian variables, and the target 
density function p{y\f) satisfies 

p{y\f)=p{x,z\6)J-*,~M, (A.l) 

where Jj,j(y) is the Jacobian for this transformation. Similarly, any predictor density 
estimator p{y\y) can be rewritten as 

p(y\y) =p(x,z\x)Jt,i(y), (A.2) 

where X is a transformation of Y defined in (2.2). Note that the two predictive density 
functions on the left and right sides of the above equation may have different functional 
forms; however, to simplify the notation, we use the same symbol p to represent them 
when the context is clear. 

Now, the average KL risk can be represented as 

R(f,p) = E Y ^ f lo g ^m 

(A.3) 



p(X,Z\8) 



p(X,Z\X) 



where the second equality follows from (A.l) and (A.2). Since X and Z are independent, 
we can split p(x,z\0) as 

p(x,z\0)=P(*\0)p(z), (A.4) 

where p{z) has a known distribution N m — n {0, I m —n) Moreover, to evaluate the minimax 
risk, it suffices to consider predictive density estimators in the form 

p(x, z\x) =p(x\x)p(z) (A. 5) 

because any predictive density p(i, z\x) can be written ss p(x,z\x) = p(x\x)p(z\x,x), and 
if p(z\x,x) is equal to p(z), then this density estimator is dominated by p(x\x)p(z) , due 
to the non-negativity of KL divergence. 
Combining (A.3)-(A.5), we have 



R(f,p) = E X x le log = R(0,p). 



P(X\0) 
•p(X;X) 



Consequently, the minimax risk in the non-parametric regression model is equal to the 
minimax risk in the Gaussian sequence model. □ 
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Proof of Lemma 4.1. Let Q be the collection of all (generalized) Bayes predictive 
densities. Then, by [5], Theorem 5, Q is a complete class for the problem of predictive 
density estimation under KL loss. Therefore, the minimax risk among all possible den- 
sity estimators is equivalent to the minimax risk among (generalized) Bayes estimators, 
namely, 

R(G)= inf sup R(9,p) = inf sup R(9,p). 
v eee P^Qeee 

Consider a Gaussian distribution its = N(0, S), where S — diag(sf , . . . , s^) and the Sj's 
satisfy condition (4.1). Then, 

R(Q) = inf sup R{9,p) (A.6) 

> inf [ R(9,p)n s (9)d9 

> inf / R(6,p)n s {9)d9-8up [ R(9,p)n s (9)d9 

> inf / R(9,p)n s (9)d9- sup / R(9,p)n s (9)d9. (A.7) 

The first term of (A.7) is the Bayes risk under its over the unconstrained parameter 
space K™. It is achieved by the linear predictive density ps; see [1]. Therefore, 



inf / R(e,p)7r s (e)de= f R(e,p s )n s (o)do 

= 7T- + > log — V"- 

2m v n+m 2m ^ v n + sf 
To bound the second term of (A.7), note that for any Bayes predictive density p„ £ Q, 

R{9,M = ±-E x ^ e \o g - 



J e p(X\9%(9'\X)d9' 



e p(X\6>) 

-Ex\e I P ~ n \ {9'\X)d9' 
n J @ 2v m 



<—E x{e I (\\9\\ 2 + P'\\ 2 H9'\X)d9' (A.IO) 



H 



mvm \ a 
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where (A.9) is due to Jensen's inequality, (A.10) is due to \\0-0'\\ 2 < 2\\9\\ 2 + 2\\8'\\ 2 and 
(A. 11) is due to 



/ \\0'\\ 2 ir(9'\x)d6' < sup \\9' 



\ — ^ 2// 5 

- 72 SUP l^ a i e i 



C_ 
7^' 



Therefore, 



sup / R(e,p)7r s {e)de< — !— [ ||0|| 2 7r s (0)d0 + ^L s (e c 

peQJec rnv m [J @c af 



1 



(A.12) 



where irs{@ c ) = J ec TTs(0)d9. Using the Cauchy-Schwarz inequality, we can further 
bound the right-hand side of (A.12) as follows: 



< 



f \\e\\ 2 7rs(0)d9+^7rs(Q c ) 

n / r \ !/ 2 n 

V / 6^s{0)d6) y^(e^+-7r s (e c ) 

i=i \"'© c ' a i 



1 



< 



1 



C 



■ C 



c 



V3-v^M+-^s(e c ) 



Then, by [3], Proposition 2, which states that if ei,...,e m are independent Gaussian 



random variables with Eek = and Ee 2 , = o~\ , then 



we have 



i=l 



1/2 



— v n J 



(A.13) 



due to condition (4.1). 

Combining (A. 7), (A. 8), (A.12) and (A.13), the theorem then follows immediately. □ 
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